mgmtd: remove bogus "hedge" code which corrupted active candidate DS#18601
Conversation
Say you have 2 mgmtd frontend sessions (2 vtysh's) the first one is long
running and is actively changing the global candidate datastore (DS),
the second one starts and exits, this code would then copy running
back over the candidate, blowing away any changes made by the first
session.
(the long running session could technically be any user)
Instead we need to trust the various cleanup code that already exits.
For example in the commit_cfg_reply on success candidate is copied to
running, and on failure *for implicit commit* running is copied back to
candidate clearing the change. This leaves the non-implicit
configuration changes in this case we actually want candidate to keep
it's changes in transactional cases, in the other case of pending commit
during a file read the code restores candidate (if needed) on exit from
"config terminal", with this call stack:
vty_config_node_exit()
nb_cli_pending_commit_check()
nb_cli_classic_commit()
nb_candidate_commit_prepare() [fail] -> copy running -> candidate
nb_candidate_commit_apply() -> copy candidate -> running
fixes FRRouting#18541
Signed-off-by: Christian Hopps <[email protected]>
|
I'm ... confused about why we should support multiple simultaneous config sessions. How does that work - we don't have ... sequence numbers or uuids or a way of detecting conflicts, do we? Could multiple sessions introduce inconsistencies that would lead to invalid config - could one session do some validation, and then the second introduce a change that invalidates that? |
|
@mjstapp Our normal CLI uses "implicit-commit" mode which means each change takes affect immediately, a lock is obtained immediate prior to making the change and released immediately after. For file loads locks are obtain prior to reading the file and released after reading the file is done. This isn't really the point though. The code being remove is blowing away the candidate config w/o obtaining any lock, and without even entering into a config state (i.e., simply doing a The code is ill-considered "just in case" "cleanup" code that is blatantly wrong. |
mjstapp
left a comment
There was a problem hiding this comment.
I think we agreed in this week's dev meeting that the __ (double-underscore) functions should be renamed; I'll push this afterwards.
Having just completed a code audit during RCA, the fact that we have 2 different argument orders for the related datastore copying functions was unnecessary and super confusing. Fix this code-maintenance/comprehension mistake and move the newer mgmtd copy routines to use the same arg order as the pre-existing underlying northbound copy functions (i.e., use `copy(dst, src)`) Signed-off-by: Christian Hopps <[email protected]>
8eb69b5 to
59d2368
Compare
Ported the fix FRRouting/frr#18601 from FRR community Tested using static routes, save and reboot. K>* 0.0.0.0/0 [0/0] via 10.59.128.1, eth0, 00:03:58 L * 1.1.1.1/32 is directly connected, Loopback0, 00:03:52 C>* 1.1.1.1/32 is directly connected, Loopback0, 00:03:52 C>* 10.59.128.0/20 is directly connected, eth0, 00:03:58 L> 10.59.142.159/32 is directly connected, eth0, 00:03:58 C>* 11.1.0.0/16 is directly connected, vsh0, 00:03:58 L> 11.1.22.238/32 is directly connected, vsh0, 00:03:58 C>* 20.1.1.0/24 is directly connected, Ethernet0, 00:02:22 L> 20.1.1.2/32 is directly connected, Ethernet0, 00:02:22 S>* 40.1.1.1/32 [1/0] is directly connected, Ethernet0, weight 1, 00:00:16 S>* 40.1.1.2/32 [1/0] is directly connected, Ethernet0, weight 1, 00:00:02
Ported the fix FRRouting/frr#18601 from FRR community Tested using static routes, save and reboot. K>* 0.0.0.0/0 [0/0] via 10.59.128.1, eth0, 00:03:58 L * 1.1.1.1/32 is directly connected, Loopback0, 00:03:52 C>* 1.1.1.1/32 is directly connected, Loopback0, 00:03:52 C>* 10.59.128.0/20 is directly connected, eth0, 00:03:58 L> 10.59.142.159/32 is directly connected, eth0, 00:03:58 C>* 11.1.0.0/16 is directly connected, vsh0, 00:03:58 L> 11.1.22.238/32 is directly connected, vsh0, 00:03:58 C>* 20.1.1.0/24 is directly connected, Ethernet0, 00:02:22 L> 20.1.1.2/32 is directly connected, Ethernet0, 00:02:22 S>* 40.1.1.1/32 [1/0] is directly connected, Ethernet0, weight 1, 00:00:16 S>* 40.1.1.2/32 [1/0] is directly connected, Ethernet0, weight 1, 00:00:02 <!-- - Note we only backport fixes to a release branch, *not* features! - Please also provide a reason for the backporting below. - e.g. - [x] 202006 --> - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 - [ ] 202111 - [ ] 202205 - [ ] 202211 - [ ] 202305 #### Tested branch (Please provide the tested image version) FRR 10.0.1
Ported the fix FRRouting/frr#18601 from FRR community Tested using static routes, save and reboot. K>* 0.0.0.0/0 [0/0] via 10.59.128.1, eth0, 00:03:58 L failure_prs.log 1.1.1.1/32 is directly connected, Loopback0, 00:03:52 C>* 1.1.1.1/32 is directly connected, Loopback0, 00:03:52 C>* 10.59.128.0/20 is directly connected, eth0, 00:03:58 L> 10.59.142.159/32 is directly connected, eth0, 00:03:58 C>* 11.1.0.0/16 is directly connected, vsh0, 00:03:58 L> 11.1.22.238/32 is directly connected, vsh0, 00:03:58 C>* 20.1.1.0/24 is directly connected, Ethernet0, 00:02:22 L> 20.1.1.2/32 is directly connected, Ethernet0, 00:02:22 S>* 40.1.1.1/32 [1/0] is directly connected, Ethernet0, weight 1, 00:00:16 S>* 40.1.1.2/32 [1/0] is directly connected, Ethernet0, weight 1, 00:00:02 <!-- - Note we only backport fixes to a release branch, *not* features! - Please also provide a reason for the backporting below. - e.g. - [x] 202006 --> - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 - [ ] 202111 - [ ] 202205 - [ ] 202211 - [ ] 202305 #### Tested branch (Please provide the tested image version) FRR 10.0.1
…ity (#1675) Sync PR #22408 from sonic-net/sonic-buildimage to 202412. Original PR: sonic-net/sonic-buildimage#22408 Co-authored-by: sudhanshukumar22 <[email protected]>
…ity (#1675) Sync PR #22408 from sonic-net/sonic-buildimage to 202412. Original PR: sonic-net/sonic-buildimage#22408 Co-authored-by: sudhanshukumar22 <[email protected]>
Say you have 2 mgmtd frontend sessions (2 vtysh's) the first one is long
running and is actively changing the global candidate datastore (DS),
the second one starts and exits, this code would then copy running
back over the candidate, blowing away any changes made by the first
session.
(the long running session could technically be any user)
Instead we need to trust the various cleanup code that already exits.
For example in the commit_cfg_reply on success candidate is copied to
running, and on failure for implicit commit running is copied back to
candidate clearing the change. This leaves the non-implicit
configuration changes in this case we actually want candidate to keep
it's changes in transactional cases, in the other case of pending commit
during a file read the code restores candidate (if needed) on exit from
"config terminal", with this call stack:
fixes #18541